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ABSTRACT 


This thesis involves the detection, recovery and/or correction of 
errors in XPL defined languages, XPL is a compiler generating system 
based on a (1,1) bounded context parser using (2,1) context to resolve 
conflicts in the grammar, and an analyzer which produces tables from a 
BNF description of the grammar for the language. The areas of spelling 
errors and errors caused by insertion/deletion are covered. Routines 
for correcting spelling errors in an ALGOL-like language are presented. 
An expanded syntax analyzer which aids in the production of a data base 
used by the compiler to correct insertion/deletion errors is also 
presented. Ideas for implementing this data base in XPL compilers, 
using heuristics to decrease the size of the insertion sets is also 


presented. 
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I. INTRODUCTION 


Consider the following two programs, one in an ALGOL-like language 


and the other in a FORTRAN-like language. 


BEGIN 


END. 


10 


100 
101 
102 
10s 


Both are 


INTEGER SIDE1, SIDE2, ROW], ROW2; 
READ (SIDE1, SIDEI, ROW], ROW2); ° 
SIDE1:=SIDE2*ROWU+ROW2 ; 
IF (SIDE1 = SIDE2) THEN WRITE (SIDE1); 
FOR I:=SIDE1 UNTL SIDE2 DO 
BEGIM 
WRITE (SIDE1-SIDE2); 
SIDE2=ROWU-SIDE1; 
EMD; 
WRITE (ROW], ROW2, SIDE1, SIDE2); 


READ (5, 100) SIDE], SIDE2, ROW1, ROW2 
SIDEIT=SIDE2*ROWU+ROWZ 

IF {SIDEIREQ -SIDEZ) WRITE (GrmOlemSt tee 
DO 10 [=SIDEI, SIDE2 

X=SIDE]-SIDE2 

WRITE (6, 102) X 

SIDE2 :=ROWU+SIDEI 

CONTINU 

WRITE (6, 103) ROW], ROW2, SIDE], SIDE2 
FO RMAT KKEKKKK 

BORMAT o#eaees 

FO RMAT KEKEKKK 

FORMAT ****** 

STOP 

EMD 


simple programs, but would needlessly fail to execute due to 


errors in keypunching. Both programs have identical errors involving 


the misspelling of keywords (reserve words) as well as identifiers. They 


both have ample local context to determine the nature of the misspellings 


as well as to correct the errors, all of which are of the nature which 


this expanded XPL system is designed to handle. The ALGOL-like language 


is rich in information which could be used to correct the misspellings, 


while the FORTRAN-like language has sufficient information but not in the 








easily accessable form of the ALGOL-like language. FORTRAN-like 
languages are characterized by a minimum of predeclaration, normally 
only arrays and the preliminary type conventions, while ALGOL-like 
languages require the predeclaration of all identifiers except labels. 

When considering errors one can divide them up into three basic 
types: 1) simple errors, e.g. misspelled identifiers, single insertions 
or single deletions, which can be corrected with no effect on the code 
generation or the execution of the program; 2) errors which can be 
corrected but may affect the execution of the program, e.g. misspelled 
identifiers with multiple possible replacements or insertions where 
multiple insertions were possible but heuristic reduction has produced 
a Single insertion, e.g. the set of arithmetic operators was reduced 
to plus only. These errors can be corrected and parsing will continue 
correctly but the execution may be erroneous; and 3) errors which are 
So severe that the compiler cannot correct them. In these cases the 
only alternative is the deletion of some text and the termination of 
code emission or marking the code as unexecutable. 

All of the errors in the two initial examples were of type one. 
These errors lead to needless resubmission and frustration on the part 
of the student or programmer who submitted them. Among the simple 
spelling errors in these samples are ROWU (ROW1), BEGIM (BEGIN), SIDE! 
(SIDE2), EMD (END), and CONTINU (CONTINUE). 

Now let us consider what is involved in the more complex problem of 
correcting errors not related to spelling. The first program contains 
an example of a simple insertion, the missing colon between SIDE2 and 
the equal sign in line eight. The second contains an example of a simple 


deletion, the colon in line seven. With a reasonable amount of effort on 








the part of the compiler designer these programs could have run at first 
submission and the execution could have been error free. Consider the 
following examples: 

LEFT = RIGHT * CORNER ALGOL 

LEFT := RIGHT * CORNER FORTRAN 
These particular cases are also type one errors and are most frustrating 
to the programmer because they are obvious to anyone with any knowledge 
of the two languages, The statements simply require the insertion/ 
deletion of a colon between the first identifier and the equal sign. 

The following error is an example of class two errors, involving 
errors where local context is not sufficient to correct the error with 
absolute certainty, 

FOR Sear N Tape... ALGOL 

DO 10 I=,N FORTRAN 
These two statements could not be parsed reasonably by any existing 
compiler. Compilers could, however, use heuristics to decide upon a 
"reasonable" insertion, make the insertion, and emit an error message. 
In both cases a default lower limit of one could be established and 
inserted in the proper positions or allow the default insertion for a 


number to be inserted and continue, 


Consider: 
INTEGER A B,C,D ALGOL 
INTEGER*4 A B,C,D FORTRAN 


These two statements require more complex heuristics involving either 
the insertion of a comma between the A and B or the deletion of the blank 


and creation of the single identifier AB. 








The final class of error is characterized by two parts of the text 
which do not belong together. Common examples of this error are extra 
cards in a program or cards out of place, e.g. ELSE A:=B followed by 
IF A LSS B THEN B:=A. 

The basis of this thesis is the XPL compiler generating system, 
SKELETON. This system is a compiler generating system based on a (1,1) 
bounded context parser using (2,1) context to resolve conflicts in the 
grammar, and an analyzer which produces parsing tables from a BNF 
(Backus Naur Form) description of the grammar for the language. 

There are two basic methods for recovering from an error. 1) Use 
semantics and local context to correct the error as best possible; and 
2) Delete text until continuing. It is this second method which is used 
by the XPL system and which is the least desirable, since it totally 
eliminates the possibility of a correct parse and the execution. XPL 
presently recognizes errors in two ways: 1) “ILLEGAL SYMBOL PAIR" 
caused by the appearance of two terminals together for which no stacking 
decision exists; 2) "NO PRODUCTION APPLICABLE", a condition occuring 
when local context requires a reduction but none can be made. This 
type of system keys on end symbols such as ";" or END and normally 
deletes input text until one of the key symbols is encountered. They 
also key on some stack symbols such as <block body> or «statement list> 
and delete useless material from the stack, although XPL in its present 
form does this only to a very limited degree. This error recovery 
technique allows, almost forces, errors to cascade down through good 
text, thus producing multiple error messages. 

The first method, on the other hand, uses local context if possible 


to correct the error and continue parsing. This can be done by checking 








spelling if identifiers or reserve words are present, or by trying to 
determine if a symbol is missing or if an extra symbol has been inserted. 

Consider figure 1, a simple program designed to test this system. 

The program contains only minor spelling errors. The ALGOL-E compiler 
[6], on which the BALGOL test compiler was based, using standard XPL 
error recovery techniques produced 19 errors — 9 illegal symbol pairs, 
9 undefined identifiers, 1 invalid for loop index, etc. — and in fact 
failed to detect all of the errors due to text deletion. The same 
program using the system presented here detected 23 errors and executed 
normally (see figure 2). 

The problem of making reasonable corrections leads us into the fields 
of artificial intelligence and pattern recognition. For example, 
spelling correction is an application of the concepts which have been 
developed in the area of pattern recognition. Many of these techniques 
are well suited to the problem of compiler design. The use of heuristics 
in compilers to correct other errors requires the building of a data base 
from which the compiler can draw information when it reaches a point 
where parsing of the program cannot continue. The present parsing 
tables are such a data base. Optimally this data base must be as small 
in size as well as proemieed in such a manner that it is easily acces- 
Sible. The data base thus becomes the basis for the correction of other 
than simple misspellings. 

The following is a discussion of: 1) a set of spelling correction 
routines implemented in an XPL compiler, 2) a method for correcting 
errors caused by insertions and deletions, 3) a new syntax analyzer which 
provides information useful in handling insertions and deletions, and 
4) some suggestions for implementation of this technique in XPL based 


compilers. 
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Figure 2, Test Program — BALGOL Compiler 


Vs 








749.97. 


oe 4 


eG OO Ke ieleME 


ODE=80 (WORDS). 
DECEMBER 4, 


Ceeoo 


2 
ie 


ENE 


= 
ALUE CHANGED ON LINE 
ALUE CHANGED ON 


Sr MOP Mts olin 


Figure 2 (Continued) 


14 


DIM MUL 


C 
I 
2 


e 
ALUE CHANGED ON LINE 


MYMNS MOO 


S: 


VALUE CHANGED ON LINE 








TIMES ON 


BEGCK Is 


i 


>> 


AT uam<isf 
i a Pee ee a | 


11 


LUE CHANGED ON LINE 


iS 


UE CHANGED ON LINE 


— 


1S 


Figure 2 (Continued) 


E 
Ss 


x 
A 
UE CHANGED ON LINE 


mS oad | 
LU LU <f Of F- © 
Tiesto AAA 


Ow 


BLOCK IS 2 
TIMES ON TH 


1 


6 


SUPE SUC Ch See ZO meCmNK eID. Ad TNE 


TY Ee Sis 


KATHY 





TIMES ON THE FOLLOWING LINES 
ENDING AT 8191 


1 
STARTING AT 308, 


WAS REFERENCED 
ODE=80 


KATHY 
10 


NIM AA 


Figure 2 (Continued) 


16 





II, SPELLING @ORREGIGK 


A. EARLY WORK 

One of the most obvious beginning points for the detection and cor- 
rection of errors in compilers is the correction of spelling errors in 
both user defined identifiers and reserved words. This beginning point 
is natural since one of the most common situations to arise is 
CIDENTIFIER><IDENTIFIER> , an “ILLEGAL SYMBOL PAIR". Spelling correction 
is clearly related to some of the initial work in the area of pattern 
matching done in the late 50's and early 60's. Works in this area 
include those of Cyril Albegra on string similarity and misspelling and 
Fred Damerau's work on computer detection and correction of spelling 
errors. 

The two major efforts in the area were: 1) Charles Blair's program 
for correcting spelling errors, and 2) Freeman's PhD dissertation at 
Cornell, in which he worked with error correction in CORC, Cornell 
Computing language. 

T. Blair's Work 

Blair's work [2] was developed as a simple, heuristic procedure 
for the correction of simple spelling errors. The program uses a dic- 
tionary of correct spellings and the idea of similarity to match 
misspellings with their corresponding dictionary word. 

The algorithm works on the basis of computing an abbreviation 
for all the dictionary words and then computes an abbreviation for any 
misspelling and bases its word matching on the abbreviations. The idea 
of the abbreviation is to isolate the “kernel" of the word, the most 


important letters which uniquely describe the word. 
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The abbreviation is computed on the basis of importance values 
assigned to the various letters of the alphabet and a position factor 
assigned to the various positions in the word. Each letter of the word 
is assigned a value by summing these two factors. The abbreviation is 
the "n" letters of the word with the lowest value, where "n" is the 
length of the abbreviation desired. The following is an example of 
length 4, 


ABSORBENT ABSORBANT 
BD), See 7 35:3 Letter score Sis teat heel 
OR 26425. 5 “5 aces «|| Position score 024555431] 
399896 1"6 4 Sum jo Jers ob 9 6 4 


el a) Deletions a 


ABBT Abbreviation ABBT 

The results are impressive. With an abbreviation length of 4 
the program recognized 89 of 117, and only totally failed on 2 words, 
but the author implies an abbreviation length of 5 would have solved the 
problem. The program has only two major problems 1) examination of 
words not in the dictionary, and 2) gross misspellings. 

Blair designed his program specifically to correct errors made 
by humans, i.e. errors produced by humans in such activities as key- 
punching and not to correct random errors such as transmission failures. 
The program is less effective in correcting this type of error. 

2. Freeman's Work 

Freeman*s method is a complex system based on the probability 
that one identifier is a misspelling of another. Freeman's work involves 
the use of a pseudo scoring polynomial, a technique very similar to the 
artificial intelligence work with polynomial evaluation. The scoring 


polynomial involves a variety of information including: 1) the number 
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of letters in the suspicious identifier which match an identifier in 
the symbol table, 2) the number of letters which match after the trans- 
positions and substitutions have been made, and 3) the number of letters 
which match after common keypunch errors are considered. This system is 
capable of detecting and correcting a great many errors but its space 
and time requirements are correspondingly large. 
3. Morgan's Method 

Morgan suggests a less powerful method, which he claims would be 
able to correct approximately 80 per cent of all spelling errors. This 
method considers only four cases: 1) one wrong letter, 2) one letter 
missing, 3) one extra character added, and 4) two adjacent letters 


transposed. 


B. CHOOSING AN ALGORITHM 

In choosing an algorithm for implementation three factors were 
considered: 1) the complexity of coding the algorithm into XPL, 2) the 
time required to perform the test on an undeclared identifier, and 
3) the space requirements of the algorithm. The latter consideration 
iiseeche result of large compiler overhead and the fact.that most priority 
Systems favor smaller jobs. 

In considering the first factor, one reason for the rejection of 
complicated methods such as Freeman's algorithm was the lack of real 
arithmetic in XPL. XPL does have good string processing and comparing 
facilities which add greatly to the ease of implementing Morgan's 
algorithm. The second factor, speed, was considered because of the 
nature of compilers and their use. Compilers which have large use by 
Students need more extensive error correction such as provided by 


Freeman's algorithm, while production compilers need far less correction 
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capabilities. The final requirement was space in implementing the 
various algorithms. This led to conclusions similar to those of 
complexity. 

For the purpose of this experimental compiler Morgan‘s algorithm 
proved to be easily implementable and quite effective. A further 
consideration was the claim by proponents of the algorithm that it could 
correct eighty per cent of all spelling errors and this was an acceptable 


level of correction. 


C. DETECTION OF MISSPELLINGS 

Detection of spelling errors is a complex problem. The problem 1s 
compounded by the various conventions the different languages have. 
Languages fall into two main classes with respect to identifiers: 

1) those with all identifiers predeclared before use, and 2) those in 
which a new simple variable may be declared simply by use whenever 
needed. Other aspects which complicate the picture are: 1) the nature 
of identifiers, i.e. attributes — simple, array or procedure, 2) the 
varying types, e.g. integer, real, etc., and 3) the fact that misspelled 
reserved words become identifiers. Virtually all languages require the 
predeclaration of arrays. The problem of type is solved by FORTRAN-1 ike 
languages by their type conventions, and ALGOL-like languages require 
predeclaration. 

Different systems have different situations which suggest that an 
identifier might be a misspelling. For example, in FORTRAN — FORMET (***) 
would obviously be suspect, and in ALGOL — FOR sassignment) UNTL would be 
certainly suspect. From the compiler's viewpoint these situations show 


up aS: 1) An identifier appears when a reserve word is expected; 2) An 
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identifier is used as a type other than its declared type; 3) An identi- 
fier appears only on the right side of assignment statements; and 4) An 
identifier which should have been predeclared does not appear in the 
symbol table. Each of the situations is unique and requires a special 
approach to solve it. 

At times during the parsing of input text a reserve word is expected, 
e.g. IF <boolean expression> THEN , and if the identifier THAN occurs it 
can be assumed with reasonable accuracy that THEN is needed. 

Type declarations also yield an obvious opportunity for detecting 
misspelled identifiers, e.g. if a label TOP is declared and the label 
TAP occurs the switch to TOP would be an obvious move. 

Single occurrence of a variable can ee to suspicion. If the 
variable ROW] occurs at several places in the program but in one equation 
ROWU occurs on the right hand side of an gassignment statement? but never 
occurs on the left of an assignment it is likely that a keypunching error 
has occurred and should be replaced by ROW]. 

When all variables are predeclared and an integer KOUNT is referenced 
but does not appear in the symbol table as an integer or as any other 
type but an integer COUNT does appear the replacement seems to be the 
most logical solution to the problem. 

The scheme presented in this thesis uses the first, second and fourth 
methods as BALGOL requires all predeclaration, The first method is the 


one primarily detected by "ILLEGAL SYMBOL PAIRS". 


D, SPELLING CORRECTION ALGORITHM 

The two main procedures (see figure 3) are MISSPELLED IDENTIFIER and 
LAFP PROC, MISSPELLED IDENTIFIER is the main procedure used in the 
detection of misspellings. MISSPELLED IDENTIFIER is called every time 
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a storage location for an identifier is requested and no match in the 
symbol table is found. The compiler used is the BALGOL~+2 compiler which 
is ALGOL-like with all identifiers predeclared, MISSPELLED IDENTIFIER 
is also called from stacking when the error "ILLEGAL SYMBOL PAIR" is 
encountered. This procedure operates by a series of calls to the five 
working procedures. (See figure 3) The procedure has two basic parts, 
the first used for misspelled identifiers and the second for misspelled 
reserve words, This procedure works only if the length of the question- 
able identifier is greater than one. The symbol table is searched 
top-down because of the heuristic decision that the most recently 
declared identifier is the most likely to be used and should be the first 
matched if possible. 

The second calling procedure is LAFP PROC which is a special purpose 
procedure used only when one of the special reserve words, LOCAL, ARRAY, 
FUNCTION, PROCEDURE, BEGIN or END is expected. LAFP is called from 
STACKING and is called only after the detection of the head symbol BEGIN. 
Once a BEGIN is detected an identifier in token will be stacked automati- 
cally, but if the identifier is really a reserve word of this special 
class then great damage is done, e.g. BEGIN LOCEL A would cause the 
identifier A not to be declared properly. This damage would have to be 
corrected if this situation were not handled specially, e.g. if LOCAL is 
misspelled then an extra reduction is made and parsing fails. In order 
to avoid removal of code which has already been generated, a test is made 


to all occurrences of BEGIN followed by an identifier. 


E, ACTUAL IMPLEMENTATION 
Spelling correction involves the selective calling of MISSPELLED _ 


IDENTIFIER and LAFP_PROC when a suspicious identifier is encountered or 
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when an "ILLEGAL SYMBOL PAIR" is detected. Suspicious identifiers are 
detected when the PRI address of an identifier is needed and the identifier 
does not occur in the symbol table. This is detected in the procedure 
LOOKUP, which is designed to locate identifiers in the symbol table and 
return their PRT address. BALGOL-2 is a blockstructured language with 
integers, integer arrays, functions and procedures only, and requires 

the declarations of all. identifiers before use. No labels are allowed, 
thus assuring the detection of misspellings by reaching symbol table 

entry O without a match. 

Misspelled identifiers are corrected with relative ease. The error 
Gemcage  <*** ERROR **** MISSPELLED IDENT Bees  REPBACEDRBy — ais 
emitted. The PRT address of the symbol table entry matched is then 
returned as the address of the misspelled identifier. 

Misspelled reserve words are more difficult to detect, since they 
take on the appearance of identifiers and often get to the parse stack 
before being detected. The nature of reserved words allows them to take 
on the appearance of many other legal combinations not covered by 
LAFP_PROC, e.g. WRITE (A,B); has the same basic structure as PROCED (A,B); 
where PROCED is the invocation of a procedure. Thus the spelling of 
WRIT (A,B) could cause erroneous parsing by allowing an identifier to 
appear on the stack in place of the reserve word WRITE. Misspelled 
reserve words can also cause illegal symbol pairs, e.g. if WHILE A LSS B 
were to appear as WHIL A LSS B then the occurrence of two identifiers 
together would cause the detection of the misspelling. 

The problem of misspelled identifiers could be attacked by a call 
to LOOKUP from the procedure SCAN whenever an identifier is detected on 
the input stream. This method has one basic drawback, the time required 


to look up every identifier. 
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Actual correction of misspelled reserve words not covered in 
LAFP_PROC is performed in the procedure STACKING which determines how 
many input symbols are to be placed on the stack before attempting a 
reduction, 

Stacking makes its decision from the top of the parse stack and the 
symbol in token, If an illegal symbol pair is detected then one attempt 
is made to resolve the conflict, providing either the top element of the 
Base stack or token 1S an identifier. [his 1S“dene by checking the top 
of the stack first for misspellings and correcting through the use of a 
case statement on all the reserve words. If this fails the token value 
is then checked and corrected in the same manner. If both fail other 


methods must be employed. 


F. OTHER CONSIDERATIONS FOR SPELLING CORRECTION 

In general there are two possible considerations which directly 
affect the ease of detecting misspellings. The first is already imple- 
mented in the BALGOL-2 compiler, i.e. predeclaration of all variables by 
type. This would allow easy detection of misspellings and facilitate 
correction. The second consideration comes when writing the BNF for the 
language, namely never allow <identifier>cidentifier> or «identifiers 
terminal to be legal if reserve word terminal is legal, e.g. instead of 
allowing <identifier> "(" to be a procedure invocation or the misspelling 
of READ, WRITE, IF, WHILE or CASE. This could be accomplished by 
enclosing boolean expressions for an IF, WHILE or CASE statement in 
vertical bars rather than "(", using broken brackets, ¢,%, around array 
Subscripts, and using exclamation points to delimit lists of input/output 


Statements. This would allow misspellings to be detected as illegal 
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symbol pairs rather than after several reductions have been made. 
Although this method would make error correction easier it might lead 


to additional errors due to the large symbol set. 
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Til. ERRORS OF INSERTION DEREIGh 


A. INITIAL COMMENTS 

The basis for the correction of errors of insertion or deletion is 
a data base which provides information about the local context at the 
point where the error was detected. In this system the context consists 


of the relationships between terminals. 


B. PREPARATION OF A DATA BASE 

In preparing a data base for the compiler'‘s use, several considera- 
tions were studied, the first consideration was what information was 
available from the tables produced by the XPL analyzer; second was the 
consideration of how the XPL compiler presently detects errors; third 
was the meaning of these errors in terms of code generation; and the 
final consideration was size requirements and access time of the data 
base. 

The XPL analyzer produces a great volume of information, including 
a vocabulary list and a Cl stacking decision array. It is around this 
array that all parsing in the compiler revolves. This Cl array contains 
one position for each element in the set {NS X NT}, NS is the number of 
symbols in the vocabulary, NT is the number of terminals, the value of 
which is 0,1,2,3 which indicates the stacking decision for the compiler. 
Thus the first method of error recognition is the occurrence of two 
symbols, X and Y, where the position (X,Y) of Cl is equal to zero. This 
also leads to a starting point for error correction. These errors may 


be correctible by insertion or deletion of proper symbols. 
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The XPL analyzer also produces an array HDIB, PRDIB, PRIB which 
holds the productions of the grammar and are used in making the reduction 
in the parsing of statements. These arrays along with the context arrays 
CONTEXT CASE, CONTEXT TRIPLE, and LEFT CONTEXT are used to decide at a 
given stage if any reduction can be made and if so which production 
should be made. This yields a second type of error, the "NO PRODUCTION 
APPLICABLE." This type of error occurs when terminals, which make sense 
in a stacking decision, appear together and are stacked. These errors 
are far more complicated. Their correction would require a compiler to 
reverse its parser, wipe out code which it had already generated, make 
an evaluation of the error and alter the input stream Henne reparsing 
the erroneous input. The "ILLEGAL SYMBOL PAIR" error once detected may 
require only the insertion of a symbol between the present stack top and 
token symbols, e.g. cidentifier> didentifier> could be corrected by the 
insertion of a "+", or as in the case of énumber> <number>, e.g. 

STEP 1 1 UNTIL, could be corrected by the deletion of the second one. 
This technique doesn't guarantee results but it could lead in many cases 
to the continuation of the parsing and the possible proper execution of 
the program. 

The final consideration is the space required and access time of the 
data base. Obviously too much space or too much time would be detrimental 
to such a system. 

There are four basic considerations in developing a data base for 
compiler correction of errors: 1) available information, 2) present 
error detection facilities, 3) meaning of these errors in terms of 
correction, and 4) space limitations. Al] these considerations must be 
taken into account when developing a base of information to be used when 


correcting errors. 
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C. INTRODUCTION TO THE ANALYZER 

Using the tables produced by the XPL analyzer three inputs are 
prepared for the new syntax analyzer produced as part of this paper. 
The first input is a simple list of the vocabulary for the grammar, all 
terminals followed by the nonterminals. The second is the Vist of 
productions for the grammar which is in the same form as for tne XPL 
analyzer. Finally the table of pseudo legal triples, which is pro- 
duced by an XPL program is prepared. A pseudo legal triple is defined 
to be a string of three terminals, A B C, such that the pair A C is an 
illegal symbol pair and further a stacking decision exists between A and 
Band B and C, e.g. c<identifier> + <identifier> is actually a good 
triple while ";" cidentifiery")" is a pseudo legal triple but cannot 
be generated from the grammar. The legal triples among these pseudo 
triples will give vatuable information on the possible insertions 
between terminals. 

The basis for the analyzer is a suggestion by David Gries [5] in 


his book Compiler Construction for Digital Computers. Gries suggests 


four possible methods of recovering from a given situation (x1,T), 


1. If there exists a rule U::=xl z TJ ... in the granmar, a 
terminal string q should be inserted such that z ->* q. 
2. If there exists a rule U::=xlV ... such that V ->zT ..., a 


terminal string q should be inserted such that z —>* t. 


Breet there exists a mule Us: %.. VT 22 sich that ¥ i eed 
and W::= xlzl is a rule, a terminal string q should be 
inserted such that z = z1z2 ->* q. 


4. If none of the above apply, T should be deleted. 


D. THE ANALYZER 
The syntax analyzer produces a large amount of information which is 


useful in the analysis of the problem of insertions/deletions, but the 
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basic output is the list of legal triples which were extracted from the 
list of pseudo legal triples. The other major outputs are the right 
sets and right pair sets of all the nonterminals as well as all legal 
triples. The right sets and right pair sets of a terminal are the sets 
of terminals generated at the far right of a generation from that non- 
terminal. Each output builds on the previous output; the right sets 
are used to generate the right pair sets and both are used to generate 
the legal triples. 

A procedure which is typical of the recursive procedures on which 
the analyzer is based is BRS. This procedure is given in figure 4, with 
a flow chart shown in figure 5. The procedure's one calling argument 
is an element of the vocabulary. 

The basic aim of the analyzer is to provide output which the compiler 
writer, and hopefully later @ more complicated analyzer, can use heuris- 
tically to provide further input to the compiler tables. This input 
would provide: 1) information to the compiler when it is no longer able 
to continue the parsing, 2) information on possible insertions which 
could be used to continue the parse and hopefully in some cases provide 
a correct solution, 3) deletion input by default, e.g. if no insertion 


is legally possible then delete the value in token. 


E. HEURISTICS 

Although the analyzer was able to cut the number of pseudo legal 
triples for the BALGOL test grammar in half there were still over 1500 
legal triples remaining. The basic heuristics used to eliminate legal 
triples, after the analyzer has removed all nonlegal triples, must be 
Such that they eliminate all duplication of equivalent terminals. 


Equivalent terminals are two terminals which have the same roles, 
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1,@. produce the same parse or essentially the same parse when inserted 
between two terminals, e.g. in most grammars, identifiers and numbers 
play the same role in arithmetic expressions. These equivalent terminals 
Should be generated by an algorithm similar to the one which generates 
Singles of nonterminals. 

The basis for the deletion of legal triples must be the elimination 
of all triples which are equivalent in an insertion set. The insertion 
set of two terminals, A and B, is defined as the set of all terminals 


which could be legally inserted between A and B. 


Gee RESULTS 

Two large grammars were run with the analyzer, the BALGOL-2 grammar 
used in the spelling correction compiler and the ALGOL-E grammar written 
by Gary Kildall. Both grammars are Similar in terms of their basic 
definitions. The BALGOL grammar has 142 symbols in the vocabulary, 56 
terminals and 157 productions, while the ALGOL-E grammar has 121 symbols, 
49 terminals and 132 productions. 

When the pseudo legal triples were produced for BALGOL there were 
2995 pseudo legal triples. ALGOL-E had 2396. After being processed by 
the analyzer only 1519 triples remained for BALGOL and only 1308 for 
ALGOL-E. This reduction is roughly half, 50.3% legal for BALGOL and 
54.6% legal in ALGOL-E. 

Now consider the heuristics that the compiler writer generates to 
reduce the number of triples which will actually be added to tne 
compiler tables. The triples wculd then be consulted by an error 
Subroutine to add or delete terminals from the input text to allow the 


parsing to continue. 
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The following are the simple heuristics which were applied to the 


legal triples outputed by the analyzer. 


ile 


If the operators =,-,*,/, '(exponentiation) are all legal 
insertions, use + and eliminate the others. 

If the logical operators AND and OR are both legal use OR and 
delete AND. 

If the logical comparatives LSS, LEQ, EQL, NEQ, GEQ, and GIR 
are all possible insertions use LSS and delete al] others. 

If any Subset of these operators or comparatives appears in an 
insertion set choose one and delete all others. 

The subset {cidentifier>, ¢number>} is reduced to the set 
{cnumber>} with the insertion of |] following the operator's *, 
/ or * and 0 elsewhere. 

Wherever the set {procedure invocation> , <simple procedure 
invocation>} appear in an insertion set delete procedure 
invocation) and insert TRACE as the terminal. 

The set {<operation> , ¢data descriptor>} is reduced to the set 
{<operation>} with a trace being generated. 

The set {<identifier>S , <string>} is reduced to {<string>} with 
"ERROR INSERTION" being the insertion. 

The set {<identifier>} is totally deleted with a special 


procedure to handle this case. 


After applying these nine heuristics to BALGOL the insertion triples 


list was reduced by 460 to 1059, a reduction of 30.3% while in ALGOL-E 


the reduction was 490 to 818, a reduction of 37.6%. BALGOL contains 


713 insertion sets and ALGOL-E has 566. Of these 713 originally only 


221 had unique insertions while after the use of the heuristics this 
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number grew to 509, i.e. 71% of the insertion sets have a unique inser- 
tion. ALGOL-E showed a similar increase with 173 initially and 414 
unique insertions after the application of the heuristics, i.e. 73% have 
only one insertion possible. These reductions can best be seen in 
figure 6, a table showing the relative numbers of the insertion sets 
before and after use of the heuristics. Thus of the original 2996 
pseudo legal triples only 1059 triples remain to be considered, a 
reduction of 64.8%. ALGOL-E originally had 2396 triples. However, only 
818 remain, a reduction of 65.9%. 

In developing the heuristics used to accomplish the above reductions 
the grammar and insertion sets were examined to determine which terminals 
played equivalent roles in the productions, e.g. cidentifier>, cnumber> 
in arithmetic expressions. When equivalent terminals were detected all 
but one was eliminated from the insertion set and other similar insertion 
sets were also examined. The equivalent terminal to be retained was 
determined as follows: 1) if the terminals were simple terminals, e.g. 
=,-,( etc., simple choice was used; or 2) if the terminals had attributes, 
e.g. ¢number>, a value, dZidentifier>, a character representation, the 
Simplest attribute to determine was used, e.g. gnumber>, cidentifier > 
the <number) was chosen since the attribute 1 could always be assigned 
to it. These equivalent terminals were initially determined by ‘consulting 
the singles table, a single of a nonterminal x is defined to be a terminal 
t such that x —> *t. These methods seem to be such that they could be 
included in an analyzer and used to reduce all insertion sets to a single 
element. The terminal would be used every time the situation occurred. 

If it was areeessaul tne parse would continue, if not other actions would 


have to be taken. 
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NUMBER OF INSERTION SETS CONTAINING N ELEMENTS 


N BALGOL BEFORE AFTER ALGOL-E BEFORE AFTER 
0 0 53 0 48 
221 509 173414 
2 398 73 326 5] 
3 25 14 18 7 
4 44 39 27 24 
5 5 5 5 5 
6 2 2 
7 4 4 0 QO. 
8 0 0 0 0 
9 0 10 0 9 

10 4 0 0 0 

1 0 4 0 5 

12 0 O° 0 

13 0 0 0 

14 6 0 0 0 

15 0 0 3 0 

16 4 0 0 0 

17 0 0 1 0 

18 0 0 0 

19 0 0 6 0 

20 0 0 0 0 

2] 0 0. 4 0 

22 B0ipeeund. ax lpempgensOs 

AD TAS 566 566 


Table of insertion sets 


figure 6 
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IV. CONCLUSION 


Figure | is a prime example of the minor errors which cause needless 
resubmission because the program contains only minor spelling errors. 
These errors are an area of major frustration to new users and to 
Students who are unfamiliar with keypunches. 

The problem of misspelled identifiers is covered in this paper as 
well as some initial work on the problem of errors of insertion/deletion. 
The problem of missing terminals in the input text is a very complicated 
one. The compiler must first generate any logical insertion, and if 
more than one insertion is possible, it must make a decision as to which 
terminal to use, when to abandon the path and try again. The basic 
analyzer to produce a first input is presented here. This analyzer 
could be the basis for a future more complete error correcting compiler. 

I feel the compiler produced could be a model for the procedures 
which could be inserted into any XPL based compiler with predeclaration 
and could easily be modified for FORTRAN-like compilers. I feel such 
routines should be basic to any nonproduction compiler or compiler used 
primarily by students. I believe the analyzer is a good first step 
toward the solving of the problem of implementation of Gries’ sugges- 
tions and toward the basic aim of compilers which can cope with almost 


any problem they might encounter. 


37 





V. TOPICS FOR FURTHER CONSIDERATION 


A. DATA BASE IMPLEMENTATION 

The most obvious extension would be the implementation of this data 
base concept into the XPL system. This would involve outputing of 
numerical data from the analyzer into another interface program to 
produce table input acceptable to XPL. The data base might be built 
as follows. There could be two index matrices of dimension NT which 
index into the data base by the token and top of the stack values, e.g. 
TOKEN INDEX would locate the base of the section containing information 
about the token and STACK INDEX would locate the subsection associated 
with the top element of the stack. The basic data base could be stored 


: byte emnty, ee byte containing 


in an integer (fixed) array with the 0° 
the stack element, gna byte containing the insertion and ait byte 
containing the token value. 

Actual implementation in the compiler would not be an easy matter 
Since whenever making an insertion with more than one alternative the 
entire context must be saved. Once the insertion has been made a method 
of determining its correctness must be found. If the insertion is not 
successful then a method must be devised to restore the system to its 
Original state. Finally if no insertion leads to a successful continu- 
ation of the parse then a more effective method must be found. A 
possible method might be the cleaning of the stack back to the last "3", 


inserting a statement and then discarding text until the next "5" is 


encountered. 


cc 





B. EXAMINE MORE CONTEXT 

One method of making the analyzer more effective in identifying 
unique insertions and building a data base for correction of insertion/ 
deletion errors would be to expand the analyzer to look at more local 


context, perhaps quadruples in place of triples. 


C. MULTIPLE INSERTIONS/DELETIONS 

Presently the analyzer is set to develop only single insertions but 
in some cases several symbols may need to be inserted or deleted in 
order to allow parsing to continue properly. An analyzer might be 


developed to aid in the solution of this problem. 


D. SPECIAL GRAMMARS 

One area of interest which has merit for further studies is the area 
of special error correcting ‘grammars which give a maximum of local con- 
text at all times. These grammars should eliminate as many multiple 


insertions as possible. 


Be HEURISTICS 10 ELIMINATE TRIPLES 
Further development could well be done with the heuristics used to 
eliminate triples. These heuristics might also be added in some way 


to the analyzer. 


F. MORE POWERFUL SPELLING CORRECTION 

The spelling correction might be made as powerful as desired, with 
Freeman's method being, perhaps, the most powerful. The present routines 
could be made more powerful by the insertion of a routine to handle 


normal upper/lower case spelling errors with more guaranteed results. 
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